Skip to main content
Scraping many URLs?

Use the Batch API — it's the recommended path for anything more than a handful of URLs. scrape() is for one-off requests: debugging, webhooks that need an immediate answer, or single-URL health checks.

Scraping

Output formats

HTML (default)

result = client.scrape("https://example.com")
print(result.html)
result = client.scrape("https://example.com", format="markdown")
print(result.markdown)

Convenience property

result.content returns whichever is available — markdown if present, otherwise HTML:

result = client.scrape("https://example.com", format="markdown")
print(result.content) # markdown

Downloading files (PDFs, images, binaries)

Since v0.6.0, client.scrape() also handles non-HTML responses — PDFs, images, ZIPs, and any other binary content. The response exposes three new fields (content_type, body_base64, body_url) and five helpers modelled on the requests library, so downloading a file is a one-liner:

resp = client.scrape("https://investors.example.com/quarterly-report.pdf", browser=False)
resp.save("quarterly-report.pdf")

Under the hood, resp.is_binary is True, resp.content_type is "application/pdf", and resp.body returns the decoded bytes. Use the is_binary flag to branch before reading text accessors:

resp = client.scrape(url)
if resp.is_binary:
# PDF, image, ZIP, etc. — text accessors return None
resp.save(f"out.{resp.content_type.split('/')[-1]}")
else:
print(resp.content) # markdown / html
print(resp.statusCode)

Available accessors

AccessorReturnsWhen to use
resp.is_binaryboolBranch on binary vs text
resp.content_typestr | NoneMIME of the response ("application/pdf", "text/html; charset=utf-8", …)
resp.bodybytesAlways-bytes accessor (text gets UTF-8 encoded)
resp.textstr | NoneNone for binary; safe text-only accessor
resp.save(path)intWrite to disk, returns bytes written
resp.contentstr | NoneLegacy text-only convenience; None for binary
resp.body_base64str | NoneWire format; almost always use body instead
resp.body_urlstr | NoneReserved for future blob offload (>5 MB)
await resp.download_body()bytesAuto-detects inline vs offloaded
resp.download_body_sync()bytesSync version of above

Common patterns

Download every PDF a site lists:

import os

for pdf_url in catalog_pdf_urls:
resp = client.scrape(pdf_url, browser=False)
if resp.is_binary and resp.content_type == "application/pdf":
resp.save(os.path.basename(pdf_url))

Branch by MIME type:

ext = {"application/pdf": "pdf", "image/png": "png", "image/jpeg": "jpg"}
resp = client.scrape(url)
if resp.is_binary:
suffix = ext.get(resp.content_type, "bin")
resp.save(f"file.{suffix}")

Mix text and binary in a batch:

batch = client.submit_batch("daily", urls)
for r in batch.iter_results():
if not r.guidance.success:
continue
if r.is_binary:
save_blob(r.custom_id, r.body)
else:
save_html(r.custom_id, r.content)

scrape_many() and batch_scrape() work the same way — every yielded / returned ScrapeResponse exposes is_binary and friends.

Browser rendering

For JavaScript-heavy sites (SPAs, React, Next.js), enable browser rendering:

result = client.scrape("https://spa-app.com", browser=True)

Automatic engine selection

When you pass browser=True, the API selects the best engine for each target domain automatically. You don't need to configure which browser to use — just ask for browser rendering and let the server route the request.

For harder sites (Google, Amazon, e-commerce with anti-bot), combine with retry_on_block and resource blocking to improve success rates:

# Standard — most JS sites
result = client.scrape("https://example.com", browser=True)

# Hard sites — retry on block + resource blocking
result = client.scrape(
"https://www.google.com/shopping/...",
browser=True,
retry_on_block=True,
block_resources=["image", "font", "media"],
)

Proxy rotation

Proxy rotation is on by default (use_proxy="any"). Every request goes through a different IP.

# Default: automatic proxy rotation
result = client.scrape("https://example.com")

# Disable proxy
result = client.scrape("https://example.com", use_proxy=None)

# Country-specific proxy (requires approval)
result = client.scrape("https://example.com", use_proxy="US")

Screenshots

Capture a full-page screenshot (requires browser=True):

result = client.scrape("https://example.com", browser=True, screenshot=True)

import base64
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))

Custom headers and cookies

result = client.scrape(
"https://example.com",
headers={"Accept-Language": "es-AR"},
cookies={"session": "abc123"},
language="es-AR",
)

POST requests

from scrapingpros import MethodPOST

result = client.scrape(
"https://api.example.com/data",
http_method=MethodPOST(payload={"query": "test"}),
)

POST to a different URL than the navigation target

Some sites require navigating to one page (to set cookies / generate a session) and POSTing to a different endpoint (an internal API or GraphQL). Set MethodPOST.url:

result = client.scrape(
"https://www.example.com/dashboard", # navigation target
http_method=MethodPOST(
url="https://api.example.com/graphql", # POST goes here
payload={"query": "..."},
),
)

If MethodPOST.url is omitted, the POST goes to the same URL as the scrape (default behavior).

Form-encoded POST (OAuth2, legacy APIs)

OAuth2 grant_type=client_credentials and most legacy form-based APIs require application/x-www-form-urlencoded request bodies, not JSON. Set content_type="form":

from scrapingpros import MethodPOST

resp = client.scrape(
"https://api.example.com/v1/oauth2/token",
http_method=MethodPOST(
payload={"grant_type": "client_credentials", "scope": "read"},
content_type="form", # default is "json"
),
headers={"Authorization": f"Basic {base64_creds}"},
)

Accepted values: "json" (default), "form". Available since v0.5.0.

Response fields

Every ScrapeResponse includes:

FieldTypeDescription
contentstrConvenience: markdown if available, else HTML
htmlstrRaw HTML (when format="html")
markdownstrClean text (when format="markdown")
statusCodeintHTTP status from target page
executionTimefloatSeconds
extracted_datadictExtracted data (see Data Extraction)
evaluate_resultslistJS evaluation results (see JavaScript Execution)
screenshotstrBase64 PNG string
guidanceScrapeGuidanceError analysis and next steps (see Response Guidance)
network_requestslistCaptured network activity
potentiallyBlockedByCaptchaboolCAPTCHA detection flag
timingsdictPerformance breakdown